Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Chengwu Duan (Jason) and 606332825

Display machine information for reproducibility:

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.1.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.1.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.33     rlang_1.1.2       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  7.657 GiB 
Freeram:   3.439 GiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 4429876
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   15516088 Jan 12 12:46 admissions.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu     427468 Jan 12 12:46 d_hcpcs.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu     859438 Jan 12 12:46 d_icd_diagnoses.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu     578517 Jan 12 12:46 d_icd_procedures.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu      12900 Jan 12 12:46 d_labitems.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   25070720 Jan 12 12:46 diagnoses_icd.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    7426955 Jan 12 12:46 drgcodes.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  508524623 Jan 12 12:46 emar.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  471096030 Jan 12 12:46 emar_detail.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    1767138 Jan 12 12:47 hcpcsevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu 1939088924 Jan 12 12:47 labevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   96698496 Jan 12 12:47 microbiologyevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   36124944 Jan 12 12:47 omr.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    2312631 Jan 12 12:47 patients.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  398753125 Jan 12 12:48 pharmacy.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  498505135 Jan 12 12:48 poe.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   25477219 Jan 12 12:48 poe_detail.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  458817415 Jan 12 12:48 prescriptions.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    6027067 Jan 12 12:48 procedures_icd.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu     122507 Jan 12 12:48 provider.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    6781247 Jan 12 12:48 services.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   36158338 Jan 12 12:48 transfers.csv.gz
ls -l ~/mimic/icu/
total 3077984
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu      35893 Jan 12 12:50 caregiver.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu 2467761053 Jan 12 12:50 chartevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu      57476 Jan 12 12:50 d_items.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   45721062 Jan 12 12:50 datetimeevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu    2614571 Jan 12 12:50 icustays.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  251962313 Jan 12 12:50 ingredientevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu  324218488 Jan 12 12:50 inputevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   38747895 Jan 12 12:50 outputevents.csv.gz
-rwxrwxrwx 1 jasonduanchengwu jasonduanchengwu   20717852 Jan 12 12:50 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Answer:

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

cat("Time taken for read.csv: ", "\n")
Time taken for read.csv:  
system.time(
  read.csv("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  5.595   0.119   5.975 
cat("Time taken for read_csv from tidyverse: ", "\n")
Time taken for read_csv from tidyverse:  
system.time(
  read_csv("~/mimic/hosp/admissions.csv.gz", 
           show_col_types = F))
   user  system elapsed 
  1.279   0.116   1.071 
cat("Time taken for fread from data.table: ", "\n")
Time taken for fread from data.table:  
system.time(
  fread("~/mimic/hosp/admissions.csv.gz"))
   user  system elapsed 
  0.823   0.068   0.731 
cat("data type for read.csv: ", "\n")
data type for read.csv:  
class(
  read.csv("~/mimic/hosp/admissions.csv.gz"))
[1] "data.frame"
cat("data type for read_csv from tidyverse: ", "\n")
data type for read_csv from tidyverse:  
class(
  read_csv("~/mimic/hosp/admissions.csv.gz", show_col_types = F))
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
cat("data type for fread from data.table: ", "\n")
data type for fread from data.table:  
class(
  fread("~/mimic/hosp/admissions.csv.gz"))
[1] "data.table" "data.frame"
cat("memory usage for read.csv: ", "\n")
memory usage for read.csv:  
object_size(
  read.csv("~/mimic/hosp/admissions.csv.gz"))
158,710,640 B
cat("memory usage for read_csv from tidyverse: ", "\n")
memory usage for read_csv from tidyverse:  
object_size(
  read_csv("~/mimic/hosp/admissions.csv.gz", show_col_types = F))
55,309,384 B
cat("memory usage for fread from data.table: ", "\n")
memory usage for fread from data.table:  
object_size(
  fread("~/mimic/hosp/admissions.csv.gz"))
50,129,376 B
# clear current workspace to ease up memory
rm(list = ls())

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

Answer: fread is the fastest. There is sutile difference in the parsed data types, for read.csv creates a data.frame, while read_csv creates a tibble which is represented by “spec_tbl_df” “tbl_df” “tbl” “data.frame”, and fread produces a data.table data. They are all types of data.frame. Memory usages are displayed from the code.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)

columns = cols(
  subject_id = col_integer(),
  hadm_id = col_integer(),
  admission_type = col_factor(),
  admission_location = col_factor(),
  discharge_location = col_factor(),
  insurance = col_factor(),
  language = col_factor(),
  marital_status = col_factor(),
  race = col_factor(),
  hospital_expire_flag = col_factor()
)
read_csv("~/mimic/hosp/admissions.csv.gz", 
         col_types = columns)
# A tibble: 431,231 × 16
   subject_id  hadm_id admittime           dischtime          
        <int>    <int> <dttm>              <dttm>             
 1   10000032 22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2   10000032 22841357 2180-06-26 18:27:00 2180-06-27 18:49:00
 3   10000032 25742920 2180-08-05 23:44:00 2180-08-07 17:50:00
 4   10000032 29079034 2180-07-23 12:35:00 2180-07-25 17:55:00
 5   10000068 25022803 2160-03-03 23:16:00 2160-03-04 06:26:00
 6   10000084 23052089 2160-11-21 01:56:00 2160-11-25 14:52:00
 7   10000084 29888819 2160-12-28 05:11:00 2160-12-28 16:07:00
 8   10000108 27250926 2163-09-27 23:17:00 2163-09-28 09:04:00
 9   10000117 22927623 2181-11-15 02:05:00 2181-11-15 14:52:00
10   10000117 27988844 2183-09-18 18:10:00 2183-09-21 16:30:00
# ℹ 431,221 more rows
# ℹ 12 more variables: deathtime <dttm>, admission_type <fct>,
#   admit_provider_id <chr>, admission_location <fct>,
#   discharge_location <fct>, insurance <fct>, language <fct>,
#   marital_status <fct>, race <fct>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <fct>
cat("Run time of read_csv default\n")
Run time of read_csv default
system.time(
  read_csv("~/mimic/hosp/admissions.csv.gz", 
                     show_col_types = F))
   user  system elapsed 
  1.233   0.106   0.981 
cat("memory usage of read_csv default\n")
memory usage of read_csv default
object_size(
  read_csv("~/mimic/hosp/admissions.csv.gz", 
                     show_col_types = F))
55,309,384 B
cat("Run time of read_csv with columns specified\n")
Run time of read_csv with columns specified
system.time(
  read_csv("~/mimic/hosp/admissions.csv.gz", 
                     col_types = columns, show_col_types = F))
   user  system elapsed 
  1.164   0.208   1.006 
cat("memory usage of read_csv with columns specified\n")
memory usage of read_csv with columns specified
object_size(
  read_csv("~/mimic/hosp/admissions.csv.gz", 
                     col_types = columns, show_col_types = F))
38,064,592 B
# clear current workspace to ease up memory
rm(list = ls())

Answer:

Run time with column specified will be slightly faster (most of the time), while the memory usage is much smaller, since we specified for IDs to be stored in integers as there will not be floating points and more than 32 bits of values. Integers occupies 4 bytes than 8 bytes in double. We also changed character strings to factor values, in essence, character vectors are stored each string value in full while factor values store unique strings as levels with integer representing each. The exact memory usages are shown in the code above.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

library(tidyverse)

system.time(
  read_csv("~/mimic/hosp/labevents.csv.gz"))

# clear current workspace to ease up memory
rm(list = ls())

Answer: The program takes a long time to run and my machine was unable to run it. The file is too large for my machine’s memory to handle.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

library(tidyverse)

system.time(
  read_csv("~/mimic/hosp/labevents.csv.gz", 
                     col_select = c(subject_id, itemid,
                                    charttime, valuenum)))

# clear current workspace to ease up memory
rm(list = ls())

Answer: This did not solve the issue, my windows machine runs this code chunk in WSL unbuntu rstudio-server, the available RAM to use is only 6.8 gb, it is unable to run in 3 attempts.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

# to check the fields
zcat < ~/mimic/hosp/labevents.csv.gz | head -n 10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,
zcat < ~/mimic/hosp/labevents.csv.gz | awk -F, 'BEGIN {OFS=","} NR==1 \
|| $5==50912 || $5==50971 || $5==50983 || $5==50902 || $5==50882 \
|| $5==51221 || $5==51301 || $5==50931 {print $2, $5, $7, $10}' \
| gzip > labevents_filtered.csv.gz

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

Answer: The answers are displayed in code.

echo "First 10 lines of the file excluding the header row:"
zcat < labevents_filtered.csv.gz | tail -n +2 | head -n 10

echo "Number of lines in the new file:"
zcat < labevents_filtered.csv.gz | wc -l
First 10 lines of the file excluding the header row:
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
Number of lines in the new file:
24855910
library(tidyverse)

cat("Run time for to ingest labevents_filtered.csv.gz\n")
Run time for to ingest labevents_filtered.csv.gz
system.time(
  read_csv("labevents_filtered.csv.gz", show_col_types = F))
   user  system elapsed 
 37.386   8.963  11.097 
# clear current workspace to ease up memory
rm(list = ls())

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

gunzip -c ~/mimic/hosp/labevents.csv.gz > labevents.csv

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

# clear current workspace to ease up memory
rm(list = ls())
library(arrow)
library(dplyr)
library(lubridate)

# measure time
cat("Time taken for the ingest+select+filter process\n")
Time taken for the ingest+select+filter process
system.time({
  labevents = open_dataset("labevents.csv", format = "csv")
  
  # select columns and filter using dplyer
  labevents_filtered = labevents %>%
    select(subject_id, itemid, charttime, valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 
                         50882, 51221, 51301, 50931))
})
   user  system elapsed 
  0.110   0.012   0.129 
# converting arrow dataset object into tibble
labevents_tibble = as_tibble(labevents_filtered)
labevents_tibble$charttime = with_tz(labevents_tibble$charttime, 
                                     tzone = "America/Los_Angeles")
library(arrow)
library(tidyverse)

# number of rows
# add 1 because nrow() excludes header row
nrow(labevents_filtered) + 1
[1] 24855910
# cat first 10 rows
labevents_filtered %>%
  head(10) %>%
  collect() %>%
  print()
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  
# clear current workspace to ease up memory
rm(list = ls())
Sys.timezone()

Disclaimer: Due to the timezone difference, there is a difference in the charttime section of the dataset, specifically the time part. Besides that the rows match and the two files have the same number of rows.

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

Answer: Apache Arrow is like an experienced driver that can drive passengers in this case the data quickly and smoothly between destinations. It is also a great communicator who knows a lot of different languages and knows the streets (environment) very well. So it can allow different system to share and process data quickly and efficiently and it can manage diverse data formats.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

# clear current workspace to ease up memory
rm(list = ls())
library(arrow)
library(dplyr)

# write data as parquet format
labevents_p = open_dataset("labevents.csv", format = "csv")
write_dataset(labevents_p, "labevents_parquet")
library(arrow)
library(tidyverse)

# measure time
cat("Time taken for the ingest+select+filter process\n")
Time taken for the ingest+select+filter process
system.time({
  lab_temp = open_dataset("labevents_parquet", format = "parquet")
  
    # select columns and filter using dplyer
  labevents_parquet = lab_temp %>%
    select(subject_id, itemid, charttime,valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 
                         50882, 51221, 51301, 50931))
})
   user  system elapsed 
  0.235   0.011   0.252 
library(arrow)
library(tidyverse)

# number of rows
nrow(labevents_parquet) +1
[1] 24855910
# first 10 rows
labevents_parquet %>%
  head(10) %>%
  collect() %>%
  print()
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  
# clear current workspace to ease up memory
rm(list = ls())

Disclaimer: Due to the timezone difference, there is a difference in the charttime section of the dataset, specifically the time part. Besides that the rows match and the two files have the same number of rows.

# clear current workspace to ease up memory
rm(list = ls())
library(arrow)
library(tidyverse)

# size of all arrow dataset
calculate_directory_size = function(dir_path) {
  # List all files in the directory, including subdirectories
  files = list.files(dir_path, 
                      recursive = TRUE, full.names = TRUE)
  
  # Get the size of each file
  file_sizes = file.info(files)$size
  
  # Calculate the total size
  total_size = sum(file_sizes)
  
  # Convert to more readable units (e.g., GB)
  total_size_gb = total_size / (1024^3)
  
  return(total_size_gb)
}

# Replace "path/to/your/dataset_directory" with the path to your Arrow dataset directory
dataset_size_gb = calculate_directory_size("./labevents_parquet")

# Print the size in GB
print(paste("Dataset size:", dataset_size_gb, "GB"))
[1] "Dataset size: 1.92787522915751 GB"
# clear current workspace to ease up memory
rm(list = ls())

Dataset size matches what file property claims.

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

Answer: Parquet format is like a smart bookshelf for digital data where you can organize for easy selective data access, you can grab what you want easily while saving space. It can save time and space.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

# clear current workspace to ease up memory
rm(list = ls())
library(duckdb)
library(arrow)

library(tidyverse)

# measure time
cat("Time taken for the ingest+select+filter process\n")
Time taken for the ingest+select+filter process
system.time({
  lab_temp = open_dataset("labevents_parquet", format = "parquet")
  conn = dbConnect(duckdb())
  labevents = to_duckdb(lab_temp, con=conn)
    # select columns and filter using dplyer
  labevents_duckdb = labevents %>%
    select(subject_id, itemid, charttime,valuenum) %>%
    filter(itemid %in% c(50912, 50971, 50983, 50902, 
                         50882, 51221, 51301, 50931))
})
   user  system elapsed 
  0.403   0.055   0.553 
library(tidyverse)

# database is lazy so it only revolves the chunk, 
# number of rows

nrow(as_tibble(labevents_duckdb)) + 1
[1] 24855910
# first 10 rows
labevents_duckdb %>%
  head(10) %>%
  collect() %>%
  print()
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  
# This will close the connection and the in-memory database will be lost
dbDisconnect(conn)

# clear current workspace to ease up memory
rm(list = ls())

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

Answer: DuckDB is like a in-built calculator and storage both in one, where you can store large amount of data and run directly within applications such as in R studio, it can allow for complex queries and supports SQL, which is useful for data manipulation and management.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Disclaimer: chartevents.csv is 28.1 GB so I think I prefer the Q2.3 method where we use bash to create a subset in .gz format since our source file is in .gz format. It is way more storage efficient than decompressing into .csv first then perform the same process.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

zcat < ~/mimic/icu/chartevents.csv.gz | awk -F, 'BEGIN {OFS=","} NR==1 \
|| $7==220045 || $7==220181 || $7==220179 || $7==223761 \
|| $7==220210 {print}' \
| gzip > chartevents_filtered.csv.gz
echo "number of rows including the header row"
zcat < chartevents_filtered.csv.gz | wc -l
number of rows including the header row
22502320
echo "first 10 rows of data of chartevents_filtered including the header row too"
zcat < chartevents_filtered.csv.gz | head -n 11
first 10 rows of data of chartevents_filtered including the header row too
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220045,97,97,bpm,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220179,93,93,mmHg,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220181,56,56,mmHg,0
10000032,29079034,39553978,66056,2180-07-23 19:00:00,2180-07-23 19:59:00,220210,16,16,insp/min,0